Hardware-Supported Fault Tolerance for Multiprocessors
نویسندگان
چکیده
To provide a computing system to be dependable fault tolerance mechanisms have to be included. Especially massive parallelism represents a new challenge for fault tolerance. In this paper we discuss basic hardware fault tolerance measures for massively parallel multiprocessors and solutions realized for and integrated into different multiprocessor architectures. Further we present our validation technique for dependability based on simulation-based fault injection.
منابع مشابه
Fault-Tolerance in Augmented Hypercube Multicomputers
This paper describes different schemes for tolerating faults in augmented hypercube multiprocessors. The architectures considered have a spare assigned to each subset of nodes (cluster). The approaches make use of hardware redundancy in the form of spare nodes and/or links and usually requires modifications in the communication as well as computation algorithms.
متن کاملExploring Salvage Techniques for Multi-core Architectures
As process technology scales, both fabrication induced and in-operation hard faults will become more prevalent, limiting yield and effective product lifetime. The simultaneous emergence of chip multiprocessors (CMPs) and revitalization of machine virtualization offers several opportunities for hard failure tolerance. In this paper, we provide preliminary analysis of methods for lifetime recover...
متن کاملFault-Tolerance with Multimodule Routers
The current multiprocessors such as Cray T D support interprocessor communication using partitioned dimension order routers PDRs In a PDR implemen tation the routing logic and switching hardware is par titioned into multiple modules with each module suit able for implementation as a chip This paper proposes a method to incorporate fault tolerance into such routers with simple changes to the rou...
متن کاملDynamic Verification of Cache Coherence Protocols
A method for improving the fault-tolerance of cache coherent multiprocessors is proposed. By dynamically verifying coherence operations in hardware, errors caused by manufacturing faults, soft errors, and design mistakes can be detected. Analogous to the DIVA concept for singleprocessor systems, a simple version of the protocol functions as a checker for the aggressive implementation. An exampl...
متن کاملFault Injection Based Validation of Fault-Tolerant Multiprocessors
One of the most crucial tasks in the design of fault-tolerant computers is the validation of the builtin error detection and handling mechanisms. Predesign validation techniques, like performability modelling and analysis, often require such information as exact failure rates, which is usually unavailable for the user. Moreover, the majority of computer failures originate from transient faults,...
متن کامل